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Abstract: This paper is about optimal estimation of the additive components 
of a nonparametric, additive isotone regression model. It is shown that asymp- 
totically up to first order, each additive component can be estimated as well as 
it could be by a least squares estimator if the other components were known. 
The algorithm for the calculation of the estimator uses backfitting. Conver- 
gence of the algorithm is shown. Finite sample properties are also compared 
through simulation experiments. 



1. Introduction 

In this paper we discuss nonparametric additive monotone regression models. We 
discuss a backfitting estimator that is based on iterative apphcation of the pool 
adjacent violator algorithm to the additive components of the model. Our main 
result states the following oracle property. Asymptotically up to first order, each 
additive component is estimated as well as it would be (by a least squares estimator) 
if the other components were known. This goes beyond the classical finding that 
the estimator achieves the same rate of convergence, independently of the number 
of additive components. The result states that the asymptotic distribution of the 
estimator does not depend on the number of components. 

We have two motivations for considering this model. First of all we think that this 
is a useful model for some applications. For a discussion of isotonic additive regres- 
sion from a more applied point, see also Bacchetti [H], Morton- Jones et al. "s^ and 
De Boer, Besten and Ter Braak But our main motivation comes from statistical 
theory. We think that the study of nonparametric models with several nonparamet- 
ric components is not fully understood. The oracle property that is stated in this 
paper for additive isotone models has been shown for smoothing estimators in some 
other nonparametric models. This property is expected to hold if the estimation of 
the different nonparametric components is based on local smoothing where the lo- 
calization takes place in different scales. An example are additive models of smooth 
functions where each localization takes place with respect to another covariate. In 
Mammen, Linton and Nielsen ^28*1 the oracle property has been verified for the lo- 
cal linear smooth backfitting estimator. As local linear estimators, also the isotonic 
least squares is a local smoother. The estimator is a local average of the response 
variable but in contrast to local linear estimators the local neighborhood is chosen 
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by the data. This data adaptive choice is automatically done by the least squares 
minimization. This understanding of isotonic least squares as a local smoother was 
our basic motivation to conjecture that for isotonic least squares the oracle property 
should hold as for local linear smooth backfitting. 

It may be conjectured that the oracle property holds for a much larger class of 
models. In Horowitz, Klemela and Mammen |19j a general approach was introduced 
for applying one-dimensional nonparametric smoothers to an additive model. The 
procedure consists of two steps. In the first step, a fit to the additive model is con- 
structed by using the projection approach of Mammen, Linton and Nielsen [i^. This 
preliminary estimator uses an undersmoothing bandwidth, so its bias terms are of 
asymptotically negligible higher order. In a second step, a one-dimensional smoother 
operates on the fitted values of the preliminary estimator. For the resulting esti- 
mator the oracle property was shown: This two step estimator is asymptotically 
equivalent to the estimator obtained by applying the one-dimensional smoother to 
a nonparametric regression model that only contains one component. It was conjec- 
tured that this result also holds in more general models where several nonparametric 
components enter into the model. Basically, a proof could be based on this two step 
procedures. The conjecture has been verified in Horowitz and Mammen 2^ 22| for 
generalized additive models with known and with unknown link function. 

The study of the oracle property goes beyond the classical analysis of rates of con- 
vergence. Rates of convergence of nonparametric estimators depend on the entropy 
of the nonparametric function class. If several nonparametric functions enter into 
the model the entropy is the sum of the entropies of the classes of the components. 
This implies that the resulting rate coincides with the rate of a model that only 
contains one nonparametric component. Thus, rate optimality can be shown for a 
large class of models with several nonparametric components by use of empirical 
process theory, see e.g. van de Geer [39j. Rate optimality for additive models was 
first shown in Stone 38[ . This property was the basic motivation for using additive 
models. In contrast to a full dimensional model it allows estimation with the same 
rate of convergence as a one-dimensional model and avoids for this reason the curse 
of dimensionality. On the other hand it is a very flexible model that covers many 
features of the data nonparametrically. For a general class of nonparametric models 
with several components rate optimality is shown in Horowitz and Mammen 21[. 

The estimator of this paper is based on backfitting. There is now a good under- 
standing of backfitting methods for additive models. For a detailed discussion of the 
basic statistical ideas see Hastie and Tibshirani [l8|. The basic asymptotic theory 
is given in Opsomer and Ruppert [s^] and Opsomer 35 1 for the classical backfitting 
and in Mammen, Linton and Nielsen 28| for the smooth backfitting. Bandwidth 
choice and practical implementations are discussed in Mammen and Park 2^ 30[ 
and Nielsen and Sperlich [s^ . The basic difference between smooth backfitting and 
backfitting lies in the fact that smooth backfitting is based on a smoothed least 
squares criterion whereas in the classical backfitting smoothing takes place only for 
the updated component. The full smoothing of the smooth backfitting algorithm 
stabilizes the numerical and the statistical performance of the estimator. In par- 
ticular this is the case for degenerated designs and for the case of many covariates 
as was shown in simulations by Nielsen and Sperlich [33| . In this paper we use 
backfitting without any smoothing. For this reason isotone additive least squares 
will have similar problems as classical backfitting and these problems will be even 
more severe because no smoothing is used at all. Smooth backfitting methods for 
generalized additive models were introduced in Yu, Park and Mammen [42j. Haag 
[ISj discusses smooth backfitting for nonparametric additive diffusion models. Tests 
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based on smooth backfitting have been considered in Haag [16[ and Mammen and 
Sperhch . Backfitting tests have been proposed in Fan and Jiang jl^] . Additive 
regression is an example of a nonparametric model where the nonparametric func- 
tion is given as a solution of an integral equation. This has been outlined in Linton 
and Mammen f24| and Carrasco, Florens and Renault @| where also other exam- 
ples of statistical integral equations are given. Examples are additive models where 
the additive components are linked as in Linton and Mammen [25] and regression 
models with dependent errors where an optimal transformation leads to an additive 
model, see Linton and Mammen [26| . The representation of estimation in additive 
models as solving an empirical integral equation can also be used to understand 
why the oracle property holds. 

In this paper we verify the oracle property for additive models of isotone func- 
tions. It is shown that each additive component can be estimated with the same 
asymptotic accuracy as if the other components would be known. We compare the 
performance of a least squares backfitting estimator with a least squares isotone 
estimator in the oracle model where only one additive component is unknown. The 
backfitting estimator is based on iterative applications of isotone least squares to 
each additive component. Our main theoretical result is that the differences be- 
tween these two estimators are of second order. This result will be given in the 
next section. The numerical performance of the isotone backfitting algorithm and 
its numerical convergence will be discussed in Section [3l Simulations for the com- 
parison of the isotone backfitting estimator with the oracle estimator are presented 
in Section IH The proofs are deferred to the Appendix. 



2. Asymptotics for additive isotone regression 

We suppose that we have i.i.d. random vectors {Y^, Xl, . . . , X^), . . . , (F", X^, . . . , 
X2) and we consider the regression model 

(1) E{Y'\Xl...,Xl)=c + m,{Xl) + --- + md{Xl) 

where mj(-)'s are monotone functions. Without loss of generality we suppose that 
all functions arc monotone increasing. We also assume that the covariables take 
values in a compact interval, [0, 1], say. For identifiability we add the normalizing 
condition 

(2) / mj{xj)dxj=0. 

Jo 

The least squares estimator for the regression model ([1]) is given as minimizer of 

n 

(3) 5](r^-c-^i(xj) 

1=1 

with respect to monotone increasing functions fii^ . . . , fid and a constant c that 
fulfill Hj{xj) dxj = 0. The resulting estimators are denoted as mi, . . . , fhd and c. 

We will compare the estimators rhj with oracle estimators that make use 

of the knowledge of mi for I ^ j. The oracle estimator m'j^ is given as minimizer 
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of 

n 

i=i i^j 

n 

= ^K-(Xj)+e'^-c-M,(Xi))2 

i=l 

with respect to a monotone increasing function fij and a constant c that fulfill 
iij{xj)dxj = 0. The resulting estimators are denoted as fhf^ and cP^. 
In the case d = 1, this gives the isotonic least squares estimator proposed by 

Brunk which is given by 

t 

(4) mi{x['^) =maxminVy(j'V(i-s + l) 

s<i t>i ^ — ^ 

where x[^\ . . . , xj;"-* are the order statistics of X^, . . . , X" and F*^-'^ is the obser- 
vation at the observed point X^^. Properties and simple computing algorithms 
are discussed e.g. in Barlow et al. Q and Robertson, Wright, and Dykstra [36|. A 
fast way to calculate the estimator is to use the Pool Adjacent Violator Algorithm 
(PAVA). In the next section we discuss a backfitting algorithm for d > 1 that is 
based on iterative use of PAVA. 

We now state a result for the asymptotic performance of fhj . We use the following 
assumptions. To have an economic notation, in the assumptions and in the proofs 
we denote different constants by the same symbol C. 

(Al) The functions mi, . . . , nid are differentiable and their derivatives are bounded 
on [0,1]. The functions are strictly monotone, in particular for G{6) ~ 
^T^f\u-v\>s.i<j<d I'lTT-ji'^') ~ it holds G{S) > C5'' for constants C, 7 > 

for all f> 0. " 

(A2) The d-dimensional vector X^ — {XI, . . . , X^j) has compact support [0, l]'^. The 
density p of X"^ is bounded away from zero and infinity on [0, 1]'* and it is 
continuous. The tuples [X'-, Y^) are i.i.d. For j,k = 1, . . . , c? the density .Xj 
of {XI, Xj) fulfills the following Lipschitz condition for constants C,p > 

sup \px^,x,{uk,Uj) - PXk.Xj{vk,Uj)\ < C\uk - Wfcl''. 

0<Uj ,Uk <1 

(A3) Given X^ the error variables e' — — c — mi{Xl) — • • • — md{Xy) have 
conditional zero mean and subexponential tails, i.e. for some 7 > and C" > 
0, it holds that 



E 



exp(7|e*|) 



X' 



< C" 



The conditional variance of e* given A"' = a; is denoted by a'^{x). The condi- 
tional variance of given XI = m is denoted by (Tf{ui). We assume that af 
is continuous at cci. 

These are weak smoothness conditions. We need (A3) to apply results from 
empirical process theory. Condition (Al) excludes the case that a function rrij has 
fiat parts. This is done for the following reason. Isotonic least squares regression 
produces piecewise constant estimators where for every piece the estimator is equal 
to the sample average of the piece. If the function is strictly monotone the pieces 
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shrink to 0, a.s. If the function has fiat parts these averages do not localize at 
the flat parts. But in our proof we make essential use of a localization argument. 
We conjecture that our oracle result that is stated below also holds for the case 
that there are flat parts. But we do not pursue to check this here. It is also of 
minor interest because at flat parts the monotone least squares estimator is of 
order to^^ — nij — op{n^^/^). Thus the oracle result fhj — m^^ = op{n^^/^) then 
only implies that fhj — mj = op(n~^/^). In particular, it does not imply that fhj 



and fh'^^ have the same asymptotic distribution limit. 

For d = \ the asymptotics for fhi are well known. Note that the estimator fhi for 
d = 1 coincides with the oracle estimator mf ^ for d> 1 that is based on isotonizing 

— c — m2(^|) — • • • — md{X^) — mi{Xl) + in the order of the values of XI 
{i — 1, . . . ,n). For the oracle model (or for the model ([1]) with d = 1) the following 
asymptotic result holds under (Al)-(A3): 

For all xi G (0, 1) it holds that 

fhf^ixi) - mi{xi) = Op(n-i/3). 
Furthermore at points xi £ (0, 1) with m'i{xi) > 0, the normalized estimator 

ai(xi)'''-^fni(xi)'^''^ 

converges in distribution to the slope of the greatest convex minorant ofW(t) + 
t^ , where W is a two-sided Brownian motion. Here, pi is the density of X\ . 

The greatest convex minorant of a function / is defined as the greatest convex 



function g with g < f, pointwise. This result can be found, e.g. in Wright |4l| and 



Leurgans [23|. Compare also Mammen [27[. For further results on the asymptotic 
law of mi^{xi) — mi{xi), compare also Groeneboom [13, 

We now state our main result about the asymptotic equivalence of fhj and fh'^^. 

Theorem 1. Make the assumptions (A1)-(A3). Then it holds for c large enough 
that 

sup \fhj{xj) - fh'j^'{xj)\ — op{n^^^^), 

n-l/3<j;j.<l-„-l/3 

sup \fhj{xj) - fhf^'{xj)\ = op{n^'^^^{\ogny) 

0<Xj<l 

The proof of Thcorem[l]can be found in the Appendix. Theorem[l]and the above 
mentioned result on fhf^ immediately implies the following corollary. 

Corollary 1. Make the assumptions (A1)~(A3). For xi G (0,1) with m'i{xi) > 
it holds that 

a-i(xi)''' ■^m'^(xi)'^' 

converges in distribution to the slope of the greatest convex minorant ofW{t) +t'^, 
where W is a two-sided Brownian motion. 



3. Algorithms for additive isotone regression 

The one-dimensional isotonic least squares estimator can be regarded as a projection 
of the observed vector (F^^^ , . • . , F^"-* ) onto the convex cone of isotonic vectors in M" 



184 



E. Mammen and K. Yu 



with respect to the scalar product (/, g) = X^^Li where / = {f^^\ . . . , /^"^) 

and g = (g^^-*, . . . , 5*-"'') G M" . Equivalcntly, we can regard it as a projection of a 
right continuous simple function with values {Y^^\ . . . , onto the convex cone 

(i) 

of right continuous simple monotone functions which can have jumps only at s. 
The projection is with respect to the L2 norm defined by the empirical measure, 
Pn{y,xi) which gives mass 1/n at each observations Other monotone 

functions m with m{X^-^^) = g*-*' would also solve the least square minimization. 

Now, we consider the optimization problem Without loss of generality, we 
drop the constant. Let Hj, j — 1, . . . , c? be the sets of isotonic vectors of length n 
or right continuous monotone simple functions which have jumps only at X^s with 
respect to the ordered Xj's. It is well known that these sets are convex cones. Then, 
our optimization problem can be written as follows: 



(5) min y (r* - 

i— 1 



We can rewrite ^ as an optimization problem over a product sets Hi x ■ • • x Hd- 
Say g = {gi,...,gd) G Hi x ■ ■ ■ x Hd where g^ e Hj for j = 1, . . . , d. Then 
the minimization problem ([5|) can be represented as minimizing a function over a 
cartesian product of sets, i.e., 

(6) min F(Y,g). 

Here, F(Y,g) = Er=i(^' - 5l - • • • 7 5^)'- 

A classical way to solve an optimization problem over product sets is a cyclic 
iterative procedure where at each step we minimize F with respect to one gj G Hj 

while keeping the other g^. G H^, j 7^ k fixed. That is to generate sequences g^ , 

r — 1,2, . . . , j — 1, . . . ,d, recursively such that minimizes F[y, gi , . . . , gjli, u, 

fj+i^'' ■ ■ ■ ^ 9d O'^sr u € Hj. This procedure for ^ entails the well known back- 
fitting procedure with isotonic regressions on Xj, Il{-\Hj) which is given as 



(7) 5r'=n Y-,M----5Pi-5fc^'-----5r' 



r = 1, 2, . . . , j = 1, . . . , d, with initial values gj"^ = where Y = (F^, . . . , Y"). For 

a more precise description, we introduce a notation Yj'^^^ = ~ g];^^^ • — g^'l'j' — 

^' ~ ' ' ' ~ 9^d^ ^' where g^''''' is the value of gk at XI, after the r-th iteration, 
i.e. y?''''! is the residual at the j-th cycle in the r-th iteration. Then, we have 



gf^ = maxminV Y;(')'t'^'/(< - s + 1). 



Here, 1^- is the residual at the fc-th cycle in the r-th iteration corresponding to 
theXf . 

Let g* be the projection of Y onto Hi -!'•■• + Hd, i.e., the minimizer of the 
problem ([5]). 

Theorem 2. The sequence, g^r,]) = J2i<k<j 9k^ +J2j<k<d9k^^^' converges to g* 
as r ^ oo for j — 1, . . . , d. Moreover if the problem ^ has a solution that is unique 
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up to an additive constant, say g* — (g*, . . . ,g*i), the sequences gp converge to a 
vector with constant entries as r ^ oo for j ~ 1, . . . , d. 

In general, g* is not unique. Let g = (gi, . . . , gd) be a solution of ([6]). Suppose, e.g. 
that there exists a tuple of non constant vectors (/i, . . . , fd) such that /{ + ■ ■ ■ + fd = 
for i — 1, . . . ,n and gj + fj are monotone. Then, one does not have the unique 

solution for ([6]) since ff] = + fj) ^^'^ 9j + fj monotone. This 

phenomenon is similar to 'concurvity', introduced in Buja et al. (1989). One simple 
example for non-uniqueness is the case that elements of X are ordered in the same 
way, i.e., Xj^ < Xj <^ X^ < X^ for any (p, q) and (j, k). For example when d — 2, 
if g solves ([5]), then {ag, (1 — a)g) for any a e [0, 1] solve Other examples occur 
if elements of X are ordered in the same way for a subregion. 

4. Simulations 

In this section, we present some simulation results for the finite sample performance. 
These numerical experiments are done by R on windows. We iterate 1000 times for 
each setting. For each iteration, we draw random samples from the following model 

(8) Y = mi{Xi) + m2{X2) + €, 

where {Xi,X2) has truncated bivariate normal distribution and e ~ A^(0, 0.5^). 

In Table [1] and [2l we present the empirical MISE (mean integrated squared 
error) of the backfitting estimator and the oracle estimator. We also report the 
ratio (B/0), MISE of the backfitting estimator to MISE of the oracle estimator. 
For Table [TJ we set rni(x) = and m2{x) — sin(7ra;/2). The results in Tabled] 
show that the backfitting estimator and the oracle estimator have a very similar 
finite sample performance. See that the ratio (B/0) is near to one in most cases 
and converges to one as sample size grows. We observe that when two covariates 
have strong negative correlation, the backfitting estimator has bigger MISE than 
the oracle estimator but the ratio (B/0) goes down to one as sample size grows. 
Figure [T] shows typical curves from the backfitting and oracle estimators for mi. We 
show the estimators that achieve 25%, 50% and 75% quantiles of the i2-distance 



Table 1 

Comparison between the backfitting and the oracle estimator: Model with mi(x) = , 
m2{x) = sin(7rx/2), sample size 200, 400, 800 and different values of p for covariate distribution 



n 


P 




mi 






m2 




Backfitting 


Oracle 


B/O 


Backfitting 


Oracle 


B/O 


200 





0.01325 


0.01347 


0.984 


0.01793 


0.01635 


1.096 




0.5 


0.01312 


0.01350 


0.972 


0.01817 


0.01674 


1.086 




-0.5 


0.01375 


0.01345 


1.022 


0.01797 


0.01609 


1.117 




0.9 


0.01345 


0.01275 


1.055 


0.01815 


0.01601 


1.134 




-0.9 


0.01894 


0.01309 


1.447 


0.02363 


0.01633 


1.447 


400 





0.00824 


0.00839 


0.982 


0.01068 


0.01000 


1.068 




0.5 


0.00825 


0.00845 


0.977 


0.01070 


0.01001 


1.063 




-0.5 


0.00831 


0.00830 


1.001 


0.01081 


0.00997 


1.084 




0.9 


0.00846 


0.00814 


1.040 


0.01092 


0.00997 


1.095 




-0.9 


0.10509 


0.00805 


1.305 


0.01311 


0.00992 


1.321 


800 





0.00512 


0.00525 


0.976 


0.00654 


0.00621 


1.053 




0.5 


0.00502 


0.00513 


0.977 


0.00646 


0.00614 


1.052 




-0.5 


0.00509 


0.00513 


0.994 


0.00660 


0.00620 


1.066 




0.9 


0.00523 


0.00500 


1.046 


0.00667 


0.00611 


1.091 




-0.9 


0.00603 


0.00498 


1.211 


0.00757 


0.00612 


1.220 
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between the backfitting and the oracle estimator for mi{x). We observe that the 
backfitting and the oracle estimator produce almost identical curves. 

Table [2] reports simulation results for the case that one component function is 
not smooth. Here, ■mi{x) = x,\x\ > 0.5; 0.5,0 < a; < 0.5; —0.5, —0.5 < a; < and 
'>Ti2{x) — sin(7ra;/2). Even in this case the backfitting estimator shows a quite good 
performance. Thus the oracle property of additive isotonic least square regression 



Table 2 

Comparison between the backfitting and the oracle estimator: Model 0) with 
mi{x) = X, \x\ > 0.5; 0.5, < a; < 0.5; —0.5, —0.5 < x < 0, m2(x) = sin(7ra;/2), sample size 200, 
400, 800 and different values of p for covariate distribution 



n 


P 




mi 






rn2 




Backfitting 


Oracle 


B/O 


Backfitting 


Oracle 


B/O 


200 





0.01684 


0.01548 


1.088 


0.01805 


0.01635 


1.104 




0.5 


0.01686 


0.01541 


1.094 


0.01756 


0.01604 


1.095 




-0.5 


0.01726 


0.01541 


1.120 


0.01806 


0.01609 


1.123 




0.9 


0.01793 


0.01554 


1.154 


0.01852 


0.01628 


1.138 




-0.9 


0.02269 


0.01554 


1.460 


0.02374 


0.01633 


1.454 


400 





0.01016 


0.00950 


1.071 


0.01094 


0.01014 


1.079 




0.5 


0.00987 


0.00944 


1.046 


0.01088 


0.01025 


1.062 




-0.5 


0.01010 


0.00944 


1.070 


0.01084 


0.00998 


1.086 




0.9 


0.01000 


0.00897 


1.115 


0.01105 


0.00996 


1.109 




-0.9 


0.01192 


0.00897 


1.330 


0.01308 


0.00996 


1.314 


800 





0.00576 


0.00552 


1.044 


0.00657 


0.00622 


1.056 




0.5 


0.00578 


0.00555 


1.041 


0.00651 


0.00617 


1.055 




-0.5 


0.00588 


0.00555 


1.059 


0.00657 


0.00614 


1.071 




0.9 


0.00598 


0.00538 


1.110 


0.00670 


0.00616 


1.088 




-0.9 


0.00695 


0.00538 


1.291 


0.00772 


0.00612 


1.262 




-1.0 -05 00 0.5 1 -1.0 -05 0.0 0.5 1 -10 -0.5 0.0 5 1.0 

XI XI XI 

Fig 1 . The real lines, dashed lines and dotted lines show the true curve, backfitting estimates and 
oracle estimates, respectively. The left, center and right panels represent fitted curves for the data 
sets that produce 25%, 50% and 75% quantiles for the distance between the backfitting and the 
oracle estimator in Monte Carlo simulations with p = 0.5 and 200 observations. 
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Appendix: Proofs 

A.l. Proof of Theorem[Ii 

The proof of Theorem [T] is divided into several lemmas. 
Lemma 3. For j ~ 1, . . . ,d it holds that 

sup \mj{uj) ~ mj{uj)\ = Op[{logn)n~^^^]. 

n-2/9<«j<l-„-2/9 

Proof. Because has subexponential tails (see (A3)) we get that supi<j<„ |e*| — 
Op{\ogn). This implies that maxi<j<c; suPq^^^^^j^ \fhj{uj)\ = Op(logn). We now 
consider the regression problem 

rV(logn) = c/(logn) + mi{Xl)/{logn) + ... + md(A^)/(logn) + £V(logn). 

Now, in this model the least squares estimators of the additive components are 
bounded and therefore we can use the entropy bound for bounded monotone func- 
tions (see e.g. (2.6) in van de Geer or Theorem 2.7.5 in van der Vaart and 
Wellner [iOl). This gives by application of empirical process theory for least squares 
estimators, see Theorem 9.2 in van de Geer [39[ that 

- J2 - ^li^i) + ■■■ + f^^diX'd) - = Op [(log n)2n-2/3]. 

i=l 

And, using Lemma 5.15 in van de Geer [3^ . this rate for the empirical norm can 
be replaced by the L2 norm: 



[TOi(?ii) - mi(Mi) + . . . + fhd{ud) - md{ud)f p{u) du = Op[(logn)^n ^/^] 
Because p is bounded from below (see (A2)) this implies 

[mi(wi) - mi(Mi) + . . . + fhd{ud) - md{ud)f du = Op[(logn)^n"^/^]. 



Because of our norming assumption ^ for fhj and m,j the left hand side of the last 
equality is equal to 

^ [mi(ui) - toi(mi)]^ dui + ... + j [fhd{ud) - T7id{ud)f dud- 

This gives 

(9) max / [%(uj) - mj{uj)f duj = Op[{lognfn-^/^]. 

i<3<dj 

We now use the fact that for j — 1, . . . ,d the derivatives m' are bounded. This 
gives together with the last bound the statement of Lemma [3l □ 
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We now define localized estimators fh'^l^^ and fhjjoc- They are defined as 
and fhj but now tlie sum of squares runs only over indices i with xj — (logn)^/'*' x 
^-2/(97)^^ < xij < Xj + {log ny^'^n~^^^^'^'>Cn, i.e. fh'^^^ minimizes 

J2 K-(Xj) + e'-mO«^(Xj)]' 

i:|X*-a;j|<(logn)i/Tn-2/(9"')c„ 

and fhj^ioc minimizes 



E 

i:|Xj-a;j|<(logn)i/Tn-2/(97)c„ 



Here c„ is a sequence with c„ — + 00 slowly enough (see below). We now argue that 

fhj^iocixj) — fhj(xj) for j — 1, . . . ,d and < Xj < 1 

(10) with probability tending to 1. 

This follows from Lemma [3l the fact that nij fulfills (Al) and the representation 
(compare (H])): 

^i:u<X^.<v^j 

(11) mj(a;j)= max 



(12) 



i:u<X^<v j 



fhj,ioc{xj) = max 

a:j-(logn)i/Tn-2/(97)c„<u<2:j 



rj<ti<Xj + (logn)i/7n-2/(97)c„ #{i : U < X ■ < "w} 



with Yj' = — X^/^iij ^li-^D- Here denotes the number of elements of a set A. 
Proceeding as in classical discussions of the case d—1 one gets: 

^j^g^ fhffocixj) = Wi'^la^j) for j = 1, . . . , d and < Xj < 1 

with probability tending to 1. 

We now consider the functions 

M,{u,,x,)^n-' J2 YJ-^"' E ^^ 

Mf''{u,,x,)^n-' ^ [m,{X])^e^]-n-^ ^ [m,(Xj)+e'], 

3—3 3—3 

Mj{uj,x,)^n-^ E E 

j — 3 J — 3 

For — (logn)-'^/'*'n~^/(^''''c„ < Uj < Xj + (logn)^/^n~^/'^^''')c„ we consider the 
functions that map #{z : < Uj} onto Mj{uj,Xj), Mj^^{uj,Xj) or Mj{uj,Xj), 
respectively. Then we get fhj^iocixj), fn-'jUci^j) and mj{xj) as the slopes of the 
greatest convex minorants of these functions at Uj — Xj . 
We now show the following lemma. 
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Lemma 4. For a > there exists a (3 > such that uniformly for 1 < I, j < d, 

0<Xj <1 and Xj - (logn)i/'^n-2/(9T)c„ < Uj < Xj + (logn)i/''n~2/(97)c„ 



(14) 
(15) 



M^''iuj,x,)-Mj{u,,xj) 



Op{{\u, 



M,{u,,Xj)-My^{u,,x,) 

^ E - 



i:X'^.<Xj 



[fhi{ui) - mi[ui)]pxi\x,['U'i\^]) dui 



+ Op{{\u, -Xj\+ 71-"}2/3n-13/27(iog„)/3)^ 



(16) 



E 



E 



[mi{ui) - mi{ui)]pxi\Xj{ui\X'^) dm 



n 



1 [Mi--X]<u,}-i^{i:Xi<x,}\ 

X j [fhi{ui) - mi{ui)]px,\x,{ui\xj) dui 

+ Opi{\uj -Xj\ + n-i}n-2p/(97)(iog„)/3). 

Proof. Claim (fT4|) is a standard result on partial sums. Claim p6|) directly follows 
from (A2). For a proof of claim p5p we use the following result: For a constant C 
suppose that A is a difference of monotone functions on [0, 1] with uniform bound 
sup^ |A(z)| < C and that Z^, . . . , Z'^ is a triangular array of independent random 
variables with values in [0, 1]. Then it holds uniformly over all functions A 

k 



i=l 



'1^{Z')-E[A{Z')] ^Op{k^'^) 
see e.g. van de Geer [3;3|- This result can be extended to 



I 

A{Z') - E[A{Z')] ^ Op{k' 

i=l 



uniformly for I < k and for A a difference of monotone functions with uniform 
bound sup^ |A(-2^)| < C- More strongly, one can show an exponential inequality for 
the left hand side. This implies that up to an additional log-factor the same rate 
applies if such an expansion is used for a polynomially growing number of settings 
with different choices of fc, Z^ and A. 

We apply this result, conditionally given Xj, . . . , X", with = X^ and A(u) = 
< u < 1 — n^^^^][fhi{u) — mi(u)]/(ri~^/^ logn). The last factor is justified 
by the statement of Lemma [3] This will be done for different choices of fc > n^~". 
Furthermore, we apply this result with — XI and A(m) = {/[O <u< n~^/^] + 
I[l — < u < l]}[?7i;(u) — m;(M)]/(logri) and k > n^^". This implies claim 

dlSl). □ 

We now show that Lemma U implies the following lemma. 
Lemma 5. Uniformly for I < j < d and n^^l'^ < xj < 1 — rt^^/'^ it holds that 



(17) m,(x,) = mf«(x,) 



-E 

1^3 



[fhi{ui) - mi[ui)]pxi\x,{ui\xj)dui + op[n ^/^) 
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and that with a constant c > uniformly for I < j < d and < Xj < n or 



1 - <Xj <1 

fhjixj) = m'j^{xj) J [mi{ui) - mi{ui)]px,\x,iui\xj)dui 

^^^^ +op(n"2/9(iog„)-). 

Proof. For a proof of ([TT]) wc use that for n^^/"^ < Xj < I — n^^l'^ 

(19) m,j{xj) < fhj{xj) < fh'j'{xj) with probabihty tending to 1, 

(20) mf^'~{xj) < mf^{xj) < m°^'+(a;j) with probabihty tending to 1, 

(21) sup m- '^{xj)-m,j (Xj) ^ op(n ' ), 

0<xj<l 



where 



m^{xj) — max 



- I IJ-lOjjV 111111 ■ 

■' Xj-e„<u<Xj-d„ Xj<v<Xj+e„ ^{i \ U < < v] 



Xj—ej 



nij [Xj) = max 



<u<Xj Xj+d„<v<Xj+e„ #{i : U < Xj < v}^ 



e„<u<Xj~d„ Xj<v<Xj+e„ ■ U < Xj < w} ' 



m ' [Xj) = max 



Xj—e„<u<Xj Xj+dn<v<Xj+en T^{i : U < X^ < w} ' 

compare (ITU) and (|12p . Here, e„ — {\ogny^'^n^^^^^''^Cn and c?„ is chosen as dn = 
n^^ with 1/3 < (5 < 4/9. Claims (dH) and (gOl) foUow immediately from the def- 



initions of the considered quantities and pOj) and p3|) . Claim (|2T|) can be estab- 
lished by using well known properties of the isotone least squares estimator. Now, 
(IISl),!!!!),!!!]) and ^ imply that uniformly for 1 < j < d and n^^/^ < xj < 
l-n-i/3 

fi^fi^j) = f^-f^'^i^j) ^ X! / ["^'(""0 ^ ^'!'i{ui)]px,\x,{ui\^3)d-ui + op(n^^/^). 

This shows claim (ITTl) . 

For the proof of (fT8)) one checks this claim separately for n^^/^(log n)^^ < 
Xj < n^^/'^ or 1 — n~^/'^ < Xj < 1 ~ n~^/^(log n)^^ (case 1) and for < Xj < 
n~^/^(logn)~^ or 1 — n~^/^(logn)^^ < Xj < 1 (case 2). The proof for Case 1 
is similar to the proof of pT|) . For the proof in Case 2 one considers the set 
— {i '■ < Xj < n^^/^(logn)^^}. It can be easily checked that with prob- 
ability tending to 1 it holds that n^^/^ < < I ~ n^^/^. Therefore it holds 
that supjgj^. ^ \fhi{Xi) — mi{Xi)\ — Op[{\ogn)n^^^^], see LemmaO Therefore for 
< Xj < n~^/^(logn)^^ the estimators fhj{xj) and fh'^^{xj) are local averages of 
quantities that differ by terms of order Op[(logn)n^^/^]. Thus also the difference of 
fhj{xj) and m^^(xj) is of order Op [(log n)n^^/^]. This shows (fT8)) for Case 2. □ 

We now show that Lemma [5] implies the statement of the theorem. 
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Proof of Theorem We rewrite equations pT)) and ^8]) as 
(22) m = + H{m - m) + A, 



where m, m*^^ and A are tuples of functions fhj, fh'^^ or Aj, respectively. For A 



it holds that 

(23) sup \l^j{xj)\^ op{n-^'^), 

r!,-l/3<2;^.<l_„-l/3 

(24) sup |A,(x,)| =op(n-2/9(logn)-). 

0<a:j<l 

Furthermore, H is the linear integral operator that corresponds to the linear map 
in ([T7|l and (fT8| . For function tuples / we denote by N f the normalized function 
tuple with {N f)j{xj) ~ fj{xj) — J fj{uj)duj and we introduce the pseudo norms 

|2 



Il/ll2= J [fl{xi) + ... + fd{xd)Vp{x)dx, 

||/IU=max sup \fj{xj)\. 

^<3<do<Xj<l 

Here pj is the marginal density of Xj and p is the joint density of X^. We make use 
of the following properties of H. On the subspace To — {f : f = Nf} the operator 
H has bounded operator norm: 

(25) sup ||ff/||2 = 0(l). 

/e:Po,ll/l|2=i 

For the maximal eigenvalue A„iax of H, it holds that 

(26) A,„ax < 1. 

Claim (1^51) follows directly from the boundedness of_£. Claim can be seen as 
follows. Compare also with Yu, Park and Mammen 42l |. 
A simple calculation gives 

(27) J {mi{ui) + ■ ■ ■ + md{uii))'^p{u) du = \\m\\2 ^ J rn^ [I — H)m{u)p{u)du. 

Let A be an eigenvalue of H and m\ be an eigen(function)vector corresponding to 
A. With ([271), we have 



II™a||2 = / ™A (-^ ~ H)mx{u)p{u)du = (1 - A) / m^mx{u)p{u)du 



Thus, the factor 1 — A must be strictly positive, i.e. A < 1. This implies I ~ H \s 
invertible and hence we get that 

N{m ~m) = {I - H)-^N{fff^ - m) + (/ - H)-^NA. 

Here we used that because of (|22]) 



N{m - ni) = N{fff^ - m) - NH{fh - m) + NA 
= N{m^^' - to) - HN{m - to) + A^A. 
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We now use 



(/ - H)-^ =I + H + H{I - H)-^H, 
(I - H)-^ =1 + H{I - H)-\ 



This gives 



N{m - m) = N{fh°^ - m) + 7VA + HN{fh°" - m) 

+ H{I - Hy^HN{m^^ - to) + H{I - Hy^A. 

We now use that 

(28) \\HN{m'^'^ - to)||2 < \\HN{m^^ - m)\\oo - op{n~^/^), 

(29) sup ||i//l|oo = 0(l). 
/e:Po,ll/l|oo=i 

Claim (I28p follows because rh'-'^ is a local average of the data, compare also Groene- 
boom [12], Groeneboom, Lopuhaa and Hooghiemstra ^Tl\ and Durot 8]. Claim ([29|) 
follows by a simple application of the Cauchy Schwarz inequality, compare also (85) 
in Mammon, Linton and Nielsen [28| . 
This implies that 

\\N{m ~m)~ N{m°^ - m) - NA\\^ = op{n-^/^). 

Thus, 

sup \N{m ~ m°^)j(xj)\ = opirT^I^), 

Ti-i/3<Xj<l-n-i/3 

sup \N(m - m°^)yxj)\ ^ op(n-2/9(log n)'=) 

0<2;j<l 

This implies the statement of Theorem [1] □ 



A. 2. Proof of Theorem\M 



For a given closed convex cone K, we call K* = {/ : {f,g) < for all g £ K} 
the dual cone of K. It is clear that K* is also a convex cone and K** — K. It is 
pointed out in Barlow and Brunk 3] that if P is a projection onto K then / — P is a 
projection onto K* where / is the identity operator. Let Pj be a projection onto Hj 
then P* = I — Pj is a projection onto H* . The backfitting procedure ([7]) to solve 
the minimization problem ([5]) corresponds in the dual problem to an algorithm 
introduced in Dykstra 9]. See also Gaffke and Mathar ,11]. We now explain this 
relation. Let Hj, j = 1, . . . ,d, be sets of monotone vectors in M" with respect to 
the orders of Xj and Pj = Il{-\Hj). Denote the residuals in algorithm ([7]) after the 
fc-th cycle in the r-th iteration with h(^r,k)- Then, we have 

/i(i4)=Y-,gW =P*Y, 

/i(i,2) = Y - ,gW - .gW = P*Y - P^P^Y = P^P^Y, 
(30) = Y - ,gW ,gW = P^ ■ ■ ■ P* Y; 
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"■(r,i) — ^ ^ 5i ^ .92 ^ 9d — ^1 ^ 92 ^ 9d > 



n'(r,k) - I - 5i 5fc - 5d 

-^A;I>I 9l 9k-l 9k+l 9d )^ 



(31) = Y - .gM = P;(Y - .g" J 



With the notation Ir.k = ^9k incremental changes at the fc-th cycle in 

the r-th iteration, equations (f30|) and (l3T|) form a Dykstra algorithm to solve the 
following optimization problem: 

n 

(32) min V(y'' - /i*)2. 

Denote the solutions of ((32|) with ft,*. Theorem 3.1 of Dykstra P] shows that 
converges to /i* as r — > oo for = 1, . . . , d. From the dual property, it is well 
known g* — Y ~ h* and also it is clear that g{r,j) = Y — ft.(rj) for j — 1, . . . , d. 
Since ft(rj) converges to /i*, 5(rj) converge to 5* as r oo for j — 1, . . . , d. The 

\r] 

convergence of follows from Lemma 4.9 of Han 17]. 
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